I. Project Description

II. Preparing Steps

1. Used Libraries:

library(tidyverse) #The "tidyverse" collects some of the most versatile R packages: ggplot2, dplyr, tidyr, readr, purrr, and tibble. The packages work in harmony to clean, process, model, and visualize data.
library(skimr) #for data summary - so sweet and I like a lot this library
library(mice) #package provides a nice function md.pattern() to get a better understanding of the pattern of missing data
library(VIM) #more helpful visual representation can be obtained using the VIM package for agrr
library(naniar) #https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html (for gg_mis_var) (Missing values)
library(mlbench) #collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository. (also has BostonHousing)
library(caret)
library(mlr)
library(tidyverse)
library(ggthemes)
library(gplots)
library(randomForest)
library(corrplot)
library(kableExtra)
library(plotly)
library(GGally) #for ggpairs

2. Loading the data:

rawdata <- read.csv("https://github.com/hnguye01/hnguye01.github.io/raw/master/DS6306/Data/CaseStudy2-data.csv")
head(rawdata)
##   ID Age Attrition    BusinessTravel DailyRate             Department
## 1  1  32        No     Travel_Rarely       117                  Sales
## 2  2  40        No     Travel_Rarely      1308 Research & Development
## 3  3  35        No Travel_Frequently       200 Research & Development
## 4  4  32        No     Travel_Rarely       801                  Sales
## 5  5  24        No Travel_Frequently       567 Research & Development
## 6  6  27        No Travel_Frequently       294 Research & Development
##   DistanceFromHome Education   EducationField EmployeeCount EmployeeNumber
## 1               13         4    Life Sciences             1            859
## 2               14         3          Medical             1           1128
## 3               18         2    Life Sciences             1           1412
## 4                1         4        Marketing             1           2016
## 5                2         1 Technical Degree             1           1646
## 6               10         2    Life Sciences             1            733
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2   Male         73              3        2
## 2                       3   Male         44              2        5
## 3                       3   Male         60              3        3
## 4                       3 Female         48              3        3
## 5                       1 Female         32              3        1
## 6                       4   Male         32              3        3
##                  JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1        Sales Executive               4      Divorced          4403
## 2      Research Director               3        Single         19626
## 3 Manufacturing Director               4        Single          9362
## 4        Sales Executive               4       Married         10422
## 5     Research Scientist               4        Single          3760
## 6 Manufacturing Director               1      Divorced          8793
##   MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1        9250                  2      Y       No                11
## 2       17544                  1      Y       No                14
## 3       19944                  2      Y       No                11
## 4       24032                  1      Y       No                19
## 5       17218                  1      Y      Yes                13
## 6        4809                  1      Y       No                21
##   PerformanceRating RelationshipSatisfaction StandardHours
## 1                 3                        3            80
## 2                 3                        1            80
## 3                 3                        3            80
## 4                 3                        3            80
## 5                 3                        3            80
## 6                 4                        3            80
##   StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 1                1                 8                     3               2
## 2                0                21                     2               4
## 3                0                10                     2               3
## 4                2                14                     3               3
## 5                0                 6                     2               3
## 6                2                 9                     4               2
##   YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1              5                  2                       0
## 2             20                  7                       4
## 3              2                  2                       2
## 4             14                 10                       5
## 5              6                  3                       1
## 6              9                  7                       1
##   YearsWithCurrManager
## 1                    3
## 2                    9
## 3                    2
## 4                    7
## 5                    3
## 6                    7
view(rawdata) #There are 870 entries, 36 total columns
length(rawdata) #[1] 36
## [1] 36
skim(rawdata) #so sweet 0- for data summary
Data summary
Name rawdata
Number of rows 870
Number of columns 36
_______________________
Column type frequency:
factor 9
numeric 27
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 730, Yes: 140
BusinessTravel 0 1 FALSE 3 Tra: 618, Tra: 158, Non: 94
Department 0 1 FALSE 3 Res: 562, Sal: 273, Hum: 35
EducationField 0 1 FALSE 6 Lif: 358, Med: 270, Mar: 100, Tec: 75
Gender 0 1 FALSE 2 Mal: 516, Fem: 354
JobRole 0 1 FALSE 9 Sal: 200, Res: 172, Lab: 153, Man: 87
MaritalStatus 0 1 FALSE 3 Mar: 410, Sin: 269, Div: 191
Over18 0 1 FALSE 1 Y: 870
OverTime 0 1 FALSE 2 No: 618, Yes: 252

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 435.50 251.29 1 218.25 435.5 652.75 870 ▇▇▇▇▇
Age 0 1 36.83 8.93 18 30.00 35.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 815.23 401.12 103 472.50 817.5 1165.75 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.34 8.14 1 2.00 7.0 14.00 29 ▇▅▂▂▂
Education 0 1 2.90 1.02 1 2.00 3.0 4.00 5 ▂▅▇▆▁
EmployeeCount 0 1 1.00 0.00 1 1.00 1.0 1.00 1 ▁▁▇▁▁
EmployeeNumber 0 1 1029.83 604.79 1 477.25 1039.0 1561.50 2064 ▇▇▇▇▇
EnvironmentSatisfaction 0 1 2.70 1.10 1 2.00 3.0 4.00 4 ▅▆▁▇▇
HourlyRate 0 1 65.61 20.13 30 48.00 66.0 83.00 100 ▇▇▆▇▇
JobInvolvement 0 1 2.72 0.70 1 2.00 3.0 3.00 4 ▁▃▁▇▁
JobLevel 0 1 2.04 1.09 1 1.00 2.0 3.00 5 ▇▇▃▂▁
JobSatisfaction 0 1 2.71 1.11 1 2.00 3.0 4.00 4 ▅▅▁▇▇
MonthlyIncome 0 1 6390.26 4597.70 1081 2839.50 4945.5 8182.00 19999 ▇▅▂▁▁
MonthlyRate 0 1 14325.62 7108.38 2094 8092.00 14074.5 20456.25 26997 ▇▇▇▇▇
NumCompaniesWorked 0 1 2.73 2.52 0 1.00 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.20 3.68 11 12.00 14.0 18.00 25 ▇▅▃▂▁
PerformanceRating 0 1 3.15 0.36 3 3.00 3.0 3.00 4 ▇▁▁▁▂
RelationshipSatisfaction 0 1 2.71 1.10 1 2.00 3.0 4.00 4 ▅▅▁▇▇
StandardHours 0 1 80.00 0.00 80 80.00 80.0 80.00 80 ▁▁▇▁▁
StockOptionLevel 0 1 0.78 0.86 0 0.00 1.0 1.00 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.05 7.51 0 6.00 10.0 15.00 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.83 1.27 0 2.00 3.0 3.00 6 ▂▇▇▂▃
WorkLifeBalance 0 1 2.78 0.71 1 2.00 3.0 3.00 4 ▁▃▁▇▂
YearsAtCompany 0 1 6.96 6.02 0 3.00 5.0 10.00 40 ▇▃▁▁▁
YearsInCurrentRole 0 1 4.20 3.64 0 2.00 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.17 3.19 0 0.00 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.14 3.57 0 2.00 3.0 7.00 17 ▇▂▅▁▁

Then the dataset has 870 observations and 36 variables.

3. Checking for missing data:

Actually by skim(rawdata), we can see there is no missing data in the dataset. However, I will introduce some other codes that can be used to check for missing data as a reference. We only need to run one code to check for missing data.

md.pattern(rawdata)
##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##     ID Age Attrition BusinessTravel DailyRate Department DistanceFromHome
## 870  1   1         1              1         1          1                1
##      0   0         0              0         0          0                0
##     Education EducationField EmployeeCount EmployeeNumber
## 870         1              1             1              1
##             0              0             0              0
##     EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 870                       1      1          1              1        1
##                           0      0          0              0        0
##     JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 870       1               1             1             1           1
##           0               0             0             0           0
##     NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 870                  1      1        1                 1                 1
##                      0      0        0                 0                 0
##     RelationshipSatisfaction StandardHours StockOptionLevel
## 870                        1             1                1
##                            0             0                0
##     TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 870                 1                     1               1              1
##                     0                     0               0              0
##     YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager  
## 870                  1                       1                    1 0
##                      0                       0                    0 0
aggr_plot <- aggr(rawdata, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(rawdata), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##                  Variable Count
##                        ID     0
##                       Age     0
##                 Attrition     0
##            BusinessTravel     0
##                 DailyRate     0
##                Department     0
##          DistanceFromHome     0
##                 Education     0
##            EducationField     0
##             EmployeeCount     0
##            EmployeeNumber     0
##   EnvironmentSatisfaction     0
##                    Gender     0
##                HourlyRate     0
##            JobInvolvement     0
##                  JobLevel     0
##                   JobRole     0
##           JobSatisfaction     0
##             MaritalStatus     0
##             MonthlyIncome     0
##               MonthlyRate     0
##        NumCompaniesWorked     0
##                    Over18     0
##                  OverTime     0
##         PercentSalaryHike     0
##         PerformanceRating     0
##  RelationshipSatisfaction     0
##             StandardHours     0
##          StockOptionLevel     0
##         TotalWorkingYears     0
##     TrainingTimesLastYear     0
##           WorkLifeBalance     0
##            YearsAtCompany     0
##        YearsInCurrentRole     0
##   YearsSinceLastPromotion     0
##      YearsWithCurrManager     0
gg_miss_var(rawdata, show_pct = TRUE) + labs(title = "Percent missing of the data") + theme(legend.position = "none", plot.title = element_text(hjust = 0.5), axis.title.y = element_text(angle = 0, vjust = 1))

Then the dataset has no missing data.

4. Dropping unused columns:

We observe by skim() or view() that there are some columns without variation. Then we can drop these columns without affecting our analysis. Observing skim(), we see Over18 has all 870 observations with value Y, EmployeeCount has all 870 observations with value 1, StandardHours has all 870 observations with value 80. In addition, 18 years old is a standard working age and 80 hours/week is high (maybe per 2 weeks - employees receive paycheck per 2 weeks). Then we can drop these three columns.

drop_columns <- which(apply(rawdata, 2, function(x) (length(unique(x)) == 1)))

cols <- names(drop_columns)
rawdata <- rawdata[,-drop_columns]

#Actually, we can drop manually by another code as  rawdata <- select(rawdata, -c("Over18","EmployeeCount", "StandardHours")) . We will get the same results finally.

skim(rawdata)
Data summary
Name rawdata
Number of rows 870
Number of columns 33
_______________________
Column type frequency:
factor 8
numeric 25
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 730, Yes: 140
BusinessTravel 0 1 FALSE 3 Tra: 618, Tra: 158, Non: 94
Department 0 1 FALSE 3 Res: 562, Sal: 273, Hum: 35
EducationField 0 1 FALSE 6 Lif: 358, Med: 270, Mar: 100, Tec: 75
Gender 0 1 FALSE 2 Mal: 516, Fem: 354
JobRole 0 1 FALSE 9 Sal: 200, Res: 172, Lab: 153, Man: 87
MaritalStatus 0 1 FALSE 3 Mar: 410, Sin: 269, Div: 191
OverTime 0 1 FALSE 2 No: 618, Yes: 252

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 435.50 251.29 1 218.25 435.5 652.75 870 ▇▇▇▇▇
Age 0 1 36.83 8.93 18 30.00 35.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 815.23 401.12 103 472.50 817.5 1165.75 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.34 8.14 1 2.00 7.0 14.00 29 ▇▅▂▂▂
Education 0 1 2.90 1.02 1 2.00 3.0 4.00 5 ▂▅▇▆▁
EmployeeNumber 0 1 1029.83 604.79 1 477.25 1039.0 1561.50 2064 ▇▇▇▇▇
EnvironmentSatisfaction 0 1 2.70 1.10 1 2.00 3.0 4.00 4 ▅▆▁▇▇
HourlyRate 0 1 65.61 20.13 30 48.00 66.0 83.00 100 ▇▇▆▇▇
JobInvolvement 0 1 2.72 0.70 1 2.00 3.0 3.00 4 ▁▃▁▇▁
JobLevel 0 1 2.04 1.09 1 1.00 2.0 3.00 5 ▇▇▃▂▁
JobSatisfaction 0 1 2.71 1.11 1 2.00 3.0 4.00 4 ▅▅▁▇▇
MonthlyIncome 0 1 6390.26 4597.70 1081 2839.50 4945.5 8182.00 19999 ▇▅▂▁▁
MonthlyRate 0 1 14325.62 7108.38 2094 8092.00 14074.5 20456.25 26997 ▇▇▇▇▇
NumCompaniesWorked 0 1 2.73 2.52 0 1.00 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.20 3.68 11 12.00 14.0 18.00 25 ▇▅▃▂▁
PerformanceRating 0 1 3.15 0.36 3 3.00 3.0 3.00 4 ▇▁▁▁▂
RelationshipSatisfaction 0 1 2.71 1.10 1 2.00 3.0 4.00 4 ▅▅▁▇▇
StockOptionLevel 0 1 0.78 0.86 0 0.00 1.0 1.00 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.05 7.51 0 6.00 10.0 15.00 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.83 1.27 0 2.00 3.0 3.00 6 ▂▇▇▂▃
WorkLifeBalance 0 1 2.78 0.71 1 2.00 3.0 3.00 4 ▁▃▁▇▂
YearsAtCompany 0 1 6.96 6.02 0 3.00 5.0 10.00 40 ▇▃▁▁▁
YearsInCurrentRole 0 1 4.20 3.64 0 2.00 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.17 3.19 0 0.00 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.14 3.57 0 2.00 3.0 7.00 17 ▇▂▅▁▁

By skim(), we can check again the new dataset and all these three columns have been dropped.

I still want to drop the columns ID and EmployeeNumber. These variables are not related to Salary or Attrition and not usefull for our analysis. They are related to individual identity of each employee. After dropping, I will run skim() to check again the dataset.

rawdata <- select(rawdata, -c("ID","EmployeeNumber"))
skim(rawdata)
Data summary
Name rawdata
Number of rows 870
Number of columns 31
_______________________
Column type frequency:
factor 8
numeric 23
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 730, Yes: 140
BusinessTravel 0 1 FALSE 3 Tra: 618, Tra: 158, Non: 94
Department 0 1 FALSE 3 Res: 562, Sal: 273, Hum: 35
EducationField 0 1 FALSE 6 Lif: 358, Med: 270, Mar: 100, Tec: 75
Gender 0 1 FALSE 2 Mal: 516, Fem: 354
JobRole 0 1 FALSE 9 Sal: 200, Res: 172, Lab: 153, Man: 87
MaritalStatus 0 1 FALSE 3 Mar: 410, Sin: 269, Div: 191
OverTime 0 1 FALSE 2 No: 618, Yes: 252

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 36.83 8.93 18 30.0 35.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 815.23 401.12 103 472.5 817.5 1165.75 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.34 8.14 1 2.0 7.0 14.00 29 ▇▅▂▂▂
Education 0 1 2.90 1.02 1 2.0 3.0 4.00 5 ▂▅▇▆▁
EnvironmentSatisfaction 0 1 2.70 1.10 1 2.0 3.0 4.00 4 ▅▆▁▇▇
HourlyRate 0 1 65.61 20.13 30 48.0 66.0 83.00 100 ▇▇▆▇▇
JobInvolvement 0 1 2.72 0.70 1 2.0 3.0 3.00 4 ▁▃▁▇▁
JobLevel 0 1 2.04 1.09 1 1.0 2.0 3.00 5 ▇▇▃▂▁
JobSatisfaction 0 1 2.71 1.11 1 2.0 3.0 4.00 4 ▅▅▁▇▇
MonthlyIncome 0 1 6390.26 4597.70 1081 2839.5 4945.5 8182.00 19999 ▇▅▂▁▁
MonthlyRate 0 1 14325.62 7108.38 2094 8092.0 14074.5 20456.25 26997 ▇▇▇▇▇
NumCompaniesWorked 0 1 2.73 2.52 0 1.0 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.20 3.68 11 12.0 14.0 18.00 25 ▇▅▃▂▁
PerformanceRating 0 1 3.15 0.36 3 3.0 3.0 3.00 4 ▇▁▁▁▂
RelationshipSatisfaction 0 1 2.71 1.10 1 2.0 3.0 4.00 4 ▅▅▁▇▇
StockOptionLevel 0 1 0.78 0.86 0 0.0 1.0 1.00 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.05 7.51 0 6.0 10.0 15.00 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.83 1.27 0 2.0 3.0 3.00 6 ▂▇▇▂▃
WorkLifeBalance 0 1 2.78 0.71 1 2.0 3.0 3.00 4 ▁▃▁▇▂
YearsAtCompany 0 1 6.96 6.02 0 3.0 5.0 10.00 40 ▇▃▁▁▁
YearsInCurrentRole 0 1 4.20 3.64 0 2.0 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.17 3.19 0 0.0 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.14 3.57 0 2.0 3.0 7.00 17 ▇▂▅▁▁

Then now we have 31 columns in the dataset.

5. Pre-processing the data:

I will convert these numeric variables to factor variables.

factorcolumns <- c("JobInvolvement", "JobSatisfaction", "PerformanceRating", "RelationshipSatisfaction", "WorkLifeBalance")

rawdata[,factorcolumns] <- lapply(rawdata[,factorcolumns], as.factor)
data0 <- rawdata #data0 - dataset that I use for the analysis
skim(data0)
Data summary
Name data0
Number of rows 870
Number of columns 31
_______________________
Column type frequency:
factor 13
numeric 18
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 730, Yes: 140
BusinessTravel 0 1 FALSE 3 Tra: 618, Tra: 158, Non: 94
Department 0 1 FALSE 3 Res: 562, Sal: 273, Hum: 35
EducationField 0 1 FALSE 6 Lif: 358, Med: 270, Mar: 100, Tec: 75
Gender 0 1 FALSE 2 Mal: 516, Fem: 354
JobInvolvement 0 1 FALSE 4 3: 514, 2: 228, 4: 81, 1: 47
JobRole 0 1 FALSE 9 Sal: 200, Res: 172, Lab: 153, Man: 87
JobSatisfaction 0 1 FALSE 4 4: 271, 3: 254, 1: 179, 2: 166
MaritalStatus 0 1 FALSE 3 Mar: 410, Sin: 269, Div: 191
OverTime 0 1 FALSE 2 No: 618, Yes: 252
PerformanceRating 0 1 FALSE 2 3: 738, 4: 132
RelationshipSatisfaction 0 1 FALSE 4 4: 264, 3: 261, 1: 174, 2: 171
WorkLifeBalance 0 1 FALSE 4 3: 532, 2: 192, 4: 98, 1: 48

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 36.83 8.93 18 30.0 35.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 815.23 401.12 103 472.5 817.5 1165.75 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.34 8.14 1 2.0 7.0 14.00 29 ▇▅▂▂▂
Education 0 1 2.90 1.02 1 2.0 3.0 4.00 5 ▂▅▇▆▁
EnvironmentSatisfaction 0 1 2.70 1.10 1 2.0 3.0 4.00 4 ▅▆▁▇▇
HourlyRate 0 1 65.61 20.13 30 48.0 66.0 83.00 100 ▇▇▆▇▇
JobLevel 0 1 2.04 1.09 1 1.0 2.0 3.00 5 ▇▇▃▂▁
MonthlyIncome 0 1 6390.26 4597.70 1081 2839.5 4945.5 8182.00 19999 ▇▅▂▁▁
MonthlyRate 0 1 14325.62 7108.38 2094 8092.0 14074.5 20456.25 26997 ▇▇▇▇▇
NumCompaniesWorked 0 1 2.73 2.52 0 1.0 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.20 3.68 11 12.0 14.0 18.00 25 ▇▅▃▂▁
StockOptionLevel 0 1 0.78 0.86 0 0.0 1.0 1.00 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.05 7.51 0 6.0 10.0 15.00 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.83 1.27 0 2.0 3.0 3.00 6 ▂▇▇▂▃
YearsAtCompany 0 1 6.96 6.02 0 3.0 5.0 10.00 40 ▇▃▁▁▁
YearsInCurrentRole 0 1 4.20 3.64 0 2.0 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.17 3.19 0 0.0 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.14 3.57 0 2.0 3.0 7.00 17 ▇▂▅▁▁

Then now we have 13 factor columns and 18 numeric columns in the dataset.

In the next part, I will do Exploratory Data Analysis (or EDA). First, I will analyze the dataset in each variable.